Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Building a test collection for complex document information processing

Identifieur interne : 001178 ( Main/Exploration ); précédent : 001177; suivant : 001179

Building a test collection for complex document information processing

Auteurs : D. Lewis [États-Unis] ; G. Agam [États-Unis] ; S. Argamon [États-Unis] ; O. Frieder [États-Unis] ; D. Grossman [États-Unis] ; J. Heard [États-Unis]

Source :

RBID : Pascal:06-0519568

Descripteurs français

English descriptors

Abstract

Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Building a test collection for complex document information processing</title>
<author>
<name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>David D. Lewis Consulting</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Heard, J" sort="Heard, J" uniqKey="Heard J" first="J." last="Heard">J. Heard</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">06-0519568</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0519568 INIST</idno>
<idno type="RBID">Pascal:06-0519568</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000362</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000424</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000348</idno>
<idno type="wicri:Area/Main/Merge">001209</idno>
<idno type="wicri:Area/Main/Curation">001178</idno>
<idno type="wicri:Area/Main/Exploration">001178</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Building a test collection for complex document information processing</title>
<author>
<name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>David D. Lewis Consulting</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Heard, J" sort="Heard, J" uniqKey="Heard J" first="J." last="Heard">J. Heard</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>Dept. of Computer Science Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
<sZ>6 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Data analysis</term>
<term>Data mining</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Document structure</term>
<term>Information access</term>
<term>Information extraction</term>
<term>Information processing</term>
<term>Information retrieval</term>
<term>Information technology</term>
<term>Optical character recognition</term>
<term>Search system</term>
<term>Signature analysis</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Recherche information</term>
<term>Traitement document</term>
<term>Traitement information</term>
<term>Accès information</term>
<term>Technologie information</term>
<term>Fouille donnée</term>
<term>Analyse donnée</term>
<term>Extraction information</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance caractère</term>
<term>Analyse documentaire</term>
<term>Système recherche</term>
<term>Structure document</term>
<term>Analyse signature</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Illinois</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Illinois">
<name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
</region>
<name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<name sortKey="Heard, J" sort="Heard, J" uniqKey="Heard J" first="J." last="Heard">J. Heard</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001178 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001178 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:06-0519568
   |texte=   Building a test collection for complex document information processing
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024